Variables selected : Weekly sales, Temperature, Unemployment rate and Fuel price
Creation of our dataset that will take the mean of the 45 stores for each variable at each date of our original dataset. Our new dataset is called "mean_data"
import pandas as pd
df = pd.read_csv('walmart-sales-dataset-of-45stores.csv')
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
df_avg = df.groupby('Date').mean()
df_avg.to_csv('mean_data.csv')
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
def _plot_series1(df, col1, col2, series_name, series_index=0):
fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
df_sorted = df.sort_values(col1, ascending=True)
palette = list(sns.palettes.mpl_palette('Dark2'))
xs = df_sorted[col1]
ys = df_sorted[col2]
plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])
sns.despine(fig=fig, ax=ax)
plt.xlabel(col1)
_ = plt.ylabel(col2)
plt.title(f'{col2} evolution in time')
def _plot_series3(df, col1, col2, col3, title, period='in time'):
fig, ax1 = plt.subplots(figsize=(10, 6))
color1 = 'tab:red'
ax1.set_xlabel(col1)
ax1.set_ylabel(col2, color=color1)
ax1.plot(df[col1], df[col2], label=col2, color=color1)
ax1.tick_params(axis='y', labelcolor=color1)
ax2 = ax1.twinx()
color2 = 'tab:blue'
ax2.set_ylabel(col3, color=color2)
ax2.plot(df[col1], df[col3], label=col3, color=color2)#, linestyle='--')
ax2.tick_params(axis='y', labelcolor=color2)
plt.title(title)
sns.despine(ax=ax1, right=False)
sns.despine(ax=ax2, left=True)
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
plt.legend(lines1 + lines2, labels1 + labels2, loc='upper left')
plt.title(f'{col2} and {col3} evolution {period}')
plt.show()
data = pd.read_csv('mean_data.csv')
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m-%d')
Vizualisations and analysis of the trend, the noise and the seasonality of each variable
Weekly Sales variable:
from statsmodels.tsa.seasonal import seasonal_decompose
data.set_index('Date', inplace=True)
# Plot the original time series data
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Weekly_Sales'], label='Original Data')
plt.title('Original Time Series Data')
plt.xlabel('Date')
plt.ylabel('Weekly_Sales')
plt.legend()
plt.show()
rolling_mean = data['Weekly_Sales'].rolling(window=12).mean()
# Plot original data and rolling mean
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Weekly_Sales'], label='Original Data')
plt.plot(data.index, rolling_mean, label='Rolling Mean', color='red')
plt.title('Original Data vs Rolling Mean (Trend)')
plt.xlabel('Date')
plt.ylabel('Weekly_Sales')
plt.legend()
plt.show()
detrended_data = data['Weekly_Sales'] - rolling_mean
decomposition = seasonal_decompose(data['Weekly_Sales'], model='additive', period=12)
# Plot trend, seasonality and noise of the variable
plt.figure(figsize=(10, 8))
plt.subplot(411)
plt.plot(data.index, decomposition.trend, label='Trend')
plt.title('Trend')
plt.xlabel('Date')
plt.ylabel('Weekly_Sales')
plt.legend()
plt.subplot(412)
plt.plot(data.index, decomposition.seasonal, label='Seasonality')
plt.title('Seasonality')
plt.xlabel('Date')
plt.ylabel('Weekly_Sales')
plt.legend()
plt.subplot(413)
plt.plot(data.index, decomposition.resid, label='Residuals')
plt.title('Noise')
plt.xlabel('Date')
plt.ylabel('Weekly_Sales')
plt.legend()
plt.tight_layout()
plt.show()
These visualizations reveal discernible trends and seasonality patterns within the data. Notably, there is a distinct upward trend observed towards the end of each year, spanning from October to January, indicating significant increases in weekly sales during these months, followed by a period of stabilization throughout the remainder of the year. This trend exhibits an annual cycle. Additionally, the presence of seasonality is evident from the regular peaks occurring approximately every two months. Furthermore, consistent noise patterns are apparent, with peaks occurring annually, contributing to the overall variability in the data. The graph using the rolling mean renforces our idea.
Fuel Prices variable:
from statsmodels.tsa.seasonal import seasonal_decompose
# Plot the original time series data
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Fuel_Price'], label='Original Data')
plt.title('Original Time Series Data')
plt.xlabel('Date')
plt.ylabel('Fuel_Price')
plt.legend()
plt.show()
rolling_mean = data['Fuel_Price'].rolling(window=12).mean()
# Plot original data and rolling mean
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Fuel_Price'], label='Original Data')
plt.plot(data.index, rolling_mean, label='Rolling Mean', color='red')
plt.title('Original Data vs Rolling Mean (Trend)')
plt.xlabel('Date')
plt.ylabel('Fuel_Price')
plt.legend()
plt.show()
detrended_data = data['Fuel_Price'] - rolling_mean
decomposition = seasonal_decompose(data['Fuel_Price'], model='additive', period=12)
# Plot trend, noise and seasonality of the variable
plt.figure(figsize=(10, 8))
plt.subplot(411)
plt.plot(data.index, decomposition.trend, label='Trend')
plt.title('Trend')
plt.xlabel('Date')
plt.ylabel('Fuel_Price')
plt.legend()
plt.subplot(412)
plt.plot(data.index, decomposition.seasonal, label='Seasonality')
plt.title('Seasonality')
plt.xlabel('Date')
plt.ylabel('Fuel_Price')
plt.legend()
plt.subplot(413)
plt.plot(data.index, decomposition.resid, label='Residuals')
plt.title('Noise')
plt.xlabel('Date')
plt.ylabel('Fuel_Price')
plt.legend()
plt.tight_layout()
plt.show()
Regarding the Fuel Prices variable, there is a noticeable upward trend observed over the years (price of 2.7 in 2010 and 3.8 in 2012), characterized by periodic peaks occurring consistently in the month of May annually. This trend graph illustrates a clear upward trajectory. Additionally, a regular seasonality pattern is discernible, with cycles recurring every three months. However, there appears to be no discernible pattern in the noise associated with this variable.
Unemployment variable
from statsmodels.tsa.seasonal import seasonal_decompose
# Plot the original time series data
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Unemployment'], label='Original Data')
plt.title('Original Time Series Data')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()
plt.show()
rolling_mean = data['Unemployment'].rolling(window=12).mean()
# Plot original data and rolling mean
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Unemployment'], label='Original Data')
plt.plot(data.index, rolling_mean, label='Rolling Mean', color='red')
plt.title('Original Data vs Rolling Mean (Trend)')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()
plt.show()
detrended_data = data['Unemployment'] - rolling_mean
decomposition = seasonal_decompose(data['Unemployment'], model='additive', period=12)
# Plot trend, noise and seasonality of the variable
plt.figure(figsize=(10, 8))
plt.subplot(411)
plt.plot(data.index, decomposition.trend, label='Trend')
plt.title('Trend')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()
plt.subplot(412)
plt.plot(data.index, decomposition.seasonal, label='Seasonality')
plt.title('Seasonality')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()
plt.subplot(413)
plt.plot(data.index, decomposition.resid, label='Residuals')
plt.title('Noise')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()
plt.tight_layout()
plt.show()
Examining the graph depicting the Unemployment rate over time, we observe a consistent decline from 8.75 in 2010 to 7 in 2012, resembling a staircase-like descent pattern. Both the trend graph and the rolling mean graph reinforce this observation, illustrating a distinct downward trend over time. Moreover, the seasonality graph reveals a recurring pattern every three months, while the noise appears to be irregular.
Temperature variable
from statsmodels.tsa.seasonal import seasonal_decompose
# Plot the original time series data
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Temperature'], label='Original Data')
plt.title('Original Time Series Data')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()
plt.show()
rolling_mean = data['Temperature'].rolling(window=12).mean()
# Plot original data and rolling mean
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Temperature'], label='Original Data')
plt.plot(data.index, rolling_mean, label='Rolling Mean', color='red')
plt.title('Original Data vs Rolling Mean (Trend)')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()
plt.show()
detrended_data = data['Temperature'] - rolling_mean
decomposition = seasonal_decompose(data['Temperature'], model='additive', period=12)
# Plot noise, trend, seasonality of the variable
plt.figure(figsize=(10, 8))
plt.subplot(411)
plt.plot(data.index, decomposition.trend, label='Trend')
plt.title('Trend')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()
plt.subplot(412)
plt.plot(data.index, decomposition.seasonal, label='Seasonality')
plt.title('Seasonality')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()
plt.subplot(413)
plt.plot(data.index, decomposition.resid, label='Residuals')
plt.title('Noise')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()
plt.tight_layout()
plt.show()
Analyzing our dataset reveals a pronounced annual pattern, with temperatures rising from February to August and declining from August to January, consistent with our expectations. The trend graph illustrates this annual pattern, while the seasonality graph depicts a three-month cycle. However, the noise graph does not display any discernible regular pattern in our data.
Analysis of the influence of each variable on the Weekly Sales
Impact of holidays on the weekly sales
data = pd.read_csv('mean_data.csv')
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m-%d')
_plot_series3(data, 'Date', 'Weekly_Sales', 'Holiday_Flag', '')
This graph unmistakably illustrates the influence of holidays on our weekly sales, characterized by a substantial increase during holiday periods, aligning with our expectations. Consequently, we can infer that holidays exert a significant impact on weekly sales.
Impact of the temperature on the weekly sales
_plot_series3(data, 'Date', 'Weekly_Sales', 'Temperature', '')
_plot_series1(data, 'Temperature', 'Weekly_Sales','')
These graphs reveal a notable relationship between the Temperature variable and our Weekly Sales variable. Particularly, we observe that for extreme temperatures (exceeding 50), there's a decrease in weekly sales. The first graph illustrates an inverse correlation between temperatures and weekly sales. As temperatures rise, weekly sales remain low and stable, whereas with decreasing temperatures, weekly sales spike. This phenomenon can be attributed to the tendency for people in the USA to reduce outdoor activities during periods of elevated temperatures.
Impact of the fuel prices on the weekly sales
_plot_series3(data, 'Date', 'Weekly_Sales', 'Fuel_Price', '')
_plot_series1(data, 'Fuel_Price', 'Weekly_Sales','')
While this graph doesn't depict a strong correlation between weekly sales and fuel prices, it does suggest that when fuel prices are at their highest during the year, weekly sales remain low and stable. Conversely, when there's a decrease in fuel prices, we observe a spike in weekly sales.
Impact of the unemployment rate on the weekly sales
_plot_series3(data, 'Date', 'Weekly_Sales', 'Unemployment', '')
According to the visualization we observed, there doesn't appear to be a correlation between the unemployment rate and weekly sales.
from prophet import Prophet
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error,mean_absolute_error
import numpy as np
Prophet model
Weekly_Sales
df = pd.read_csv('mean_data.csv')
df_avg_sales = df.groupby(['Date'])['Weekly_Sales'].mean().reset_index()
df_avg_sales.rename(columns={'Date': 'ds', 'Weekly_Sales': 'y'}, inplace=True)
df_avg_sales['ds'] = pd.to_datetime(df_avg_sales['ds'])
model = Prophet()
model_fit = model.fit(df_avg_sales)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/ohx9l7mb.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/36srii73.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=336', 'data', 'file=/tmp/tmpnr05yd_d/ohx9l7mb.json', 'init=/tmp/tmpnr05yd_d/36srii73.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modeliumalliw/prophet_model-20240509170339.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:03:39 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:03:39 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
future_dates = model_fit.make_future_dataframe(periods=90)
forecast = model_fit.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales['ds'], df_avg_sales['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')
plt.xlabel('Date')
plt.ylabel('Weekly Sales Value')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
model = Prophet()
model.fit(df_avg_sales)
future_dates = model.make_future_dataframe(periods=365*2)
forecast = model.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/coa7z28b.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/5bib_afs.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=31678', 'data', 'file=/tmp/tmpnr05yd_d/coa7z28b.json', 'init=/tmp/tmpnr05yd_d/5bib_afs.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_model8ziacp8p/prophet_model-20240509170341.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:03:41 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:03:41 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales['ds'], df_avg_sales['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')
plt.xlabel('Date')
plt.ylabel('Weekly Sales Value')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
forecast = model.predict(df_avg_sales)
future = model.make_future_dataframe(periods=365*2)
forecast = model.predict(future)
fig=model.plot(forecast)
fig.gca().set_xlabel("Date")
fig.gca().set_ylabel("Weekly Sales Value")
Text(87.59722222222221, 0.5, 'Weekly Sales Value')
observed_values = df_avg_sales['y']
predicted_values = forecast['yhat']
predicted_trimmed = predicted_values.iloc[:len(observed_values)]
mae = mean_absolute_error(observed_values, predicted_trimmed)
mse = mean_squared_error(observed_values, predicted_trimmed)
rmse = np.sqrt(mse)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 45093.7644299397 Mean Squared Error (MSE): 6039856547.081733 Root Mean Squared Error (RMSE): 77716.51399208364
Temperature
df = pd.read_csv('mean_data.csv')
df_avg_sales_Temperature = df.groupby(['Date'])['Temperature'].mean().reset_index()
df_avg_sales_Temperature.rename(columns={'Date': 'ds', 'Temperature': 'y'}, inplace=True)
df_avg_sales_Temperature['ds'] = pd.to_datetime(df_avg_sales_Temperature['ds'])
model = Prophet()
model_fit = model.fit(df_avg_sales_Temperature)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/g9mcgcct.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/06dtztqy.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=87329', 'data', 'file=/tmp/tmpnr05yd_d/g9mcgcct.json', 'init=/tmp/tmpnr05yd_d/06dtztqy.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_model3f3za3p3/prophet_model-20240509170347.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:03:47 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:03:47 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
future_dates = model_fit.make_future_dataframe(periods=90)
forecast = model_fit.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Temperature['ds'], df_avg_sales_Temperature['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
model = Prophet()
model.fit(df_avg_sales_Temperature)
future_dates = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/ydwc_kxe.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/dacni0wo.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=38568', 'data', 'file=/tmp/tmpnr05yd_d/ydwc_kxe.json', 'init=/tmp/tmpnr05yd_d/dacni0wo.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modell2ptilpu/prophet_model-20240509170349.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:03:49 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:03:49 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Temperature['ds'], df_avg_sales_Temperature['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
forecast = model.predict(df_avg_sales_Temperature)
future = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future)
fig=model.plot(forecast)
fig.gca().set_xlabel("Date")
fig.gca().set_ylabel("Temperature")
Text(92.09722222222221, 0.5, 'Temperature')
observed_values = df_avg_sales_Temperature['y']
predicted_values = forecast['yhat']
predicted_trimmed = predicted_values.iloc[:len(observed_values)]
mae = mean_absolute_error(observed_values, predicted_trimmed)
mse = mean_squared_error(observed_values, predicted_trimmed)
rmse = np.sqrt(mse)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 1.6586769187551424 Mean Squared Error (MSE): 4.952063967332992 Root Mean Squared Error (RMSE): 2.225323339951521
Unemployment
df = pd.read_csv('mean_data.csv')
df_avg_sales_Unemployment = df.groupby(['Date'])['Unemployment'].mean().reset_index()
df_avg_sales_Unemployment.rename(columns={'Date': 'ds', 'Unemployment': 'y'}, inplace=True)
df_avg_sales_Unemployment['ds'] = pd.to_datetime(df_avg_sales_Unemployment['ds'])
model = Prophet()
model_fit = model.fit(df_avg_sales_Unemployment)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/82t8gzsv.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/so8wwnxc.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=81468', 'data', 'file=/tmp/tmpnr05yd_d/82t8gzsv.json', 'init=/tmp/tmpnr05yd_d/so8wwnxc.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modelr81_3pdp/prophet_model-20240509170352.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:03:52 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:03:53 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
future_dates = model_fit.make_future_dataframe(periods=90)
forecast = model_fit.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Unemployment['ds'], df_avg_sales_Unemployment['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
model = Prophet()
model.fit(df_avg_sales_Unemployment)
future_dates = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/0obtr2o1.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/9qnawtv2.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=22855', 'data', 'file=/tmp/tmpnr05yd_d/0obtr2o1.json', 'init=/tmp/tmpnr05yd_d/9qnawtv2.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modeljsmase6z/prophet_model-20240509170353.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:03:53 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:03:53 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Unemployment['ds'], df_avg_sales_Unemployment['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
forecast = model.predict(df_avg_sales_Unemployment)
future = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future)
fig=model.plot(forecast)
fig.gca().set_xlabel("Date")
fig.gca().set_ylabel("Unemployment")
Text(100.84722222222221, 0.5, 'Unemployment')
observed_values = df_avg_sales_Unemployment['y']
predicted_values = forecast['yhat']
predicted_trimmed = predicted_values.iloc[:len(observed_values)]
mae = mean_absolute_error(observed_values, predicted_trimmed)
mse = mean_squared_error(observed_values, predicted_trimmed)
rmse = np.sqrt(mse)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 0.023509013836708368 Mean Squared Error (MSE): 0.0011406637919996426 Root Mean Squared Error (RMSE): 0.033773714512911404
Fuel_Price
df = pd.read_csv('mean_data.csv')
df_avg_sales_Fuel = df.groupby(['Date'])['Fuel_Price'].mean().reset_index()
df_avg_sales_Fuel.rename(columns={'Date': 'ds', 'Fuel_Price': 'y'}, inplace=True)
df_avg_sales_Fuel['ds'] = pd.to_datetime(df_avg_sales_Fuel['ds'])
model = Prophet()
model_fit = model.fit(df_avg_sales_Fuel)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/u_mdv96m.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/sg97ka15.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=11337', 'data', 'file=/tmp/tmpnr05yd_d/u_mdv96m.json', 'init=/tmp/tmpnr05yd_d/sg97ka15.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modelkqoxx6cu/prophet_model-20240509170357.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:03:57 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:03:57 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
future_dates = model_fit.make_future_dataframe(periods=90)
forecast = model_fit.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Fuel['ds'], df_avg_sales_Fuel['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')
plt.xlabel('Date')
plt.ylabel('Fuel Price')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
model = Prophet()
model.fit(df_avg_sales_Fuel)
future_dates = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/2ldkimy1.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/qd1p2srl.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=17347', 'data', 'file=/tmp/tmpnr05yd_d/2ldkimy1.json', 'init=/tmp/tmpnr05yd_d/qd1p2srl.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modelli4bmhat/prophet_model-20240509170358.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:03:58 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:03:58 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Fuel['ds'], df_avg_sales_Fuel['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')
plt.xlabel('Date')
plt.ylabel('Fuel Price')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
forecast = model.predict(df_avg_sales_Fuel)
future = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future)
fig=model.plot(forecast)
fig.gca().set_xlabel("Date")
fig.gca().set_ylabel("Fuel Price")
Text(100.84722222222221, 0.5, 'Fuel Price')
observed_values = df_avg_sales_Fuel['y']
predicted_values = forecast['yhat']
predicted_trimmed = predicted_values.iloc[:len(observed_values)]
mae = mean_absolute_error(observed_values, predicted_trimmed)
mse = mean_squared_error(observed_values, predicted_trimmed)
rmse = np.sqrt(mse)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 0.05872585431994161 Mean Squared Error (MSE): 0.005055406525859416 Root Mean Squared Error (RMSE): 0.07110138202496077
Explanation of our Prophet model:
We opted to employ the Prophet model as a method for modeling and forecasting our data beacuse it is a robust tool specifically designed to handle time series data characterized by multiple seasonality, missing data, and outliers. It offers flexibility in modeling holidays and trend changes, rendering it suitable for retail sales forecasting. Prophet automatically detects changepoints in trends and integrates them into forecasts. Its parameters include seasonality mode, holidays, and regularization parameters. This model allow us to visualize our data according to the data, bu also our trend and seasonality. Then, when we used it in order to forecast our values, we also obtained a forecast of the trend and seasonality of each variable. In our code we did a forecast on 3 years.
After applying this model to our selected variables, we computed metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) in order to be able to compare its performances with the SARIMA model.
SARIMA Model
Fuel Price
df = pd.read_csv('mean_data.csv')
df_avg_sales_FP = df.groupby(['Date'])['Fuel_Price'].mean()
rng = pd.date_range(start='02-05-2010', end='10-31-2012', freq='W')
df_avg_sales_FP = pd.DataFrame(df_avg_sales_FP)
df_avg_sales_FP.set_index(rng, inplace=True)
df2 = df_avg_sales_FP[["Fuel_Price"]]
df2.set_index(rng, inplace=True)
df2.plot(figsize=(30,10), linewidth=5, fontsize=20)
plt.title("Plot of Fuel_Price", fontsize=30)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Fuel_Price', fontsize=20)
plt.show()
import statsmodels.api as sm
ordre_sarima = (1, 0, 0)
# Seasonal order (P, D, Q, s)
ordre_saisonnier_1 = (0, 1, 0, 12)
ordre_saisonnier_2 = (1, 1, 1, 12)
ordre_saisonnier_3 = (0, 1, 1, 12)
ordre_saisonnier_4 = (0, 0, 1, 12)
ordre_saisonnier_5 = (1, 1, 0, 12)
modele_sarima_3 = sm.tsa.SARIMAX(df2, order=ordre_sarima, seasonal_order=ordre_saisonnier_3, exog=None)
modele_sarima_3_fit = modele_sarima_3.fit(disp=False)
print(modele_sarima_3_fit.summary())
modele_sarima_3_fit.plot_diagnostics(lags=20, figsize=(15, 12))
plt.show()
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:1009: UserWarning: Non-invertible starting seasonal moving average Using zeros as starting parameters.
warn('Non-invertible starting seasonal moving average'
SARIMAX Results
============================================================================================
Dep. Variable: Fuel_Price No. Observations: 143
Model: SARIMAX(1, 0, 0)x(0, 1, [1], 12) Log Likelihood 200.211
Date: Thu, 09 May 2024 AIC -394.423
Time: 17:04:07 BIC -385.797
Sample: 02-07-2010 HQIC -390.918
- 10-28-2012
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.9993 0.318 3.142 0.002 0.376 1.623
ma.S.L12 -0.9982 9.140 -0.109 0.913 -18.913 16.916
sigma2 0.0022 0.020 0.113 0.910 -0.036 0.041
===================================================================================
Ljung-Box (L1) (Q): 52.88 Jarque-Bera (JB): 0.30
Prob(Q): 0.00 Prob(JB): 0.86
Heteroskedasticity (H): 2.54 Skew: 0.10
Prob(H) (two-sided): 0.00 Kurtosis: 2.88
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
forecast = modele_sarima_3_fit.get_forecast(steps=143)
forecast_values = forecast.predicted_mean
forecast_conf_int = forecast.conf_int()
plt.figure(figsize=(12, 6))
plt.plot(df2, label='Original Series')
plt.plot(forecast_values.index, forecast_values.values, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
plt.xlabel('Date')
plt.ylabel('Fuel Price')
plt.title('Original Series and Forecast')
plt.legend()
plt.grid(True)
plt.show()
mae = mean_absolute_error(df2, forecast_values)
mse = mean_squared_error(df2, forecast_values)
rmse = np.sqrt(mse)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 0.4633197017480666 Mean Squared Error (MSE): 0.3640109180565085 Root Mean Squared Error (RMSE): 0.6033331733433099
Weekly_Sales
df = pd.read_csv('mean_data.csv')
df_avg_sales_WS = df.groupby(['Date'])['Weekly_Sales'].mean()
rng = pd.date_range(start='02-05-2010', end='10-31-2012', freq='W')
df_avg_sales_WS = pd.DataFrame(df_avg_sales_WS)
df_avg_sales_WS.set_index(rng, inplace=True)
df2 = df_avg_sales_WS[["Weekly_Sales"]]
df2.set_index(rng, inplace=True)
df2.plot(figsize=(30,10), linewidth=5, fontsize=20)
plt.title("Plot of Weekly_Sales", fontsize=30)
plt.xlabel('Year', fontsize=20)
plt.ylabel('Weekly_Sales', fontsize=20)
plt.show()
ordre_sarima = (1, 0, 0)
# Seasonal order (P, D, Q, s)
ordre_saisonnier_1 = (0, 1, 0, 12)
ordre_saisonnier_2 = (1, 1, 1, 12)
ordre_saisonnier_3 = (0, 1, 1, 12)
ordre_saisonnier_4 = (0, 0, 1, 12)
ordre_saisonnier_5 = (1, 0, 2, 12)
modele_sarima_3 = sm.tsa.SARIMAX(df2, order=ordre_sarima, seasonal_order=ordre_saisonnier_3, exog=None)
modele_sarima_3_fit = modele_sarima_3.fit(disp=False)
print(modele_sarima_3_fit.summary())
modele_sarima_3_fit.plot_diagnostics(lags=20, figsize=(15, 12))
plt.show()
SARIMAX Results
============================================================================================
Dep. Variable: Weekly_Sales No. Observations: 143
Model: SARIMAX(1, 0, 0)x(0, 1, [1], 12) Log Likelihood -1755.118
Date: Thu, 09 May 2024 AIC 3516.237
Time: 17:04:11 BIC 3524.862
Sample: 02-07-2010 HQIC 3519.742
- 10-28-2012
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.6550 0.054 12.052 0.000 0.548 0.762
ma.S.L12 -0.6707 0.135 -4.963 0.000 -0.936 -0.406
sigma2 2.98e+10 6.9e-13 4.32e+22 0.000 2.98e+10 2.98e+10
===================================================================================
Ljung-Box (L1) (Q): 3.33 Jarque-Bera (JB): 84.94
Prob(Q): 0.07 Prob(JB): 0.00
Heteroskedasticity (H): 0.25 Skew: 0.39
Prob(H) (two-sided): 0.00 Kurtosis: 6.87
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 2.98e+38. Standard errors may be unstable.
forecast = modele_sarima_3_fit.get_forecast(steps=143)
forecast_values = forecast.predicted_mean
forecast_conf_int = forecast.conf_int()
plt.figure(figsize=(12, 6))
plt.plot(df2, label='Original Series')
plt.plot(forecast_values.index, forecast_values.values, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
plt.xlabel('Date')
plt.ylabel('Weekly Sales')
plt.title('Original Series and Forecast')
plt.legend()
plt.grid(True)
plt.show()
mae = mean_absolute_error(df2, forecast_values)
mse = mean_squared_error(df2, forecast_values)
rmse = np.sqrt(mse)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 70117.34550314884 Mean Squared Error (MSE): 16009144406.13689 Root Mean Squared Error (RMSE): 126527.2476826114
Temperature
df = pd.read_csv('mean_data.csv')
df_avg_sales_T = df.groupby(['Date'])['Temperature'].mean()
rng = pd.date_range(start='02-05-2010', end='10-31-2012', freq='W')
df_avg_sales_T = pd.DataFrame(df_avg_sales_T)
df_avg_sales_T.set_index(rng, inplace=True)
df2 = df_avg_sales_T[["Temperature"]]
df2.set_index(rng, inplace=True)
df2.plot(figsize=(30,10), linewidth=5, fontsize=20)
plt.title("Plot of Temperatures", fontsize=30)
plt.xlabel('Year', fontsize=20)
plt.ylabel('Temperatures', fontsize=20)
plt.show()
ordre_sarima = (1, 0, 0)
# Seasonal order (P, D, Q, s)
ordre_saisonnier_1 = (0, 1, 0, 12)
ordre_saisonnier_2 = (1, 1, 1, 12)
ordre_saisonnier_3 = (0, 1, 1, 12)
ordre_saisonnier_4 = (0, 0, 1, 12)
ordre_saisonnier_5 = (1, 1, 0, 12)
modele_sarima_3 = sm.tsa.SARIMAX(df2, order=ordre_sarima, seasonal_order=ordre_saisonnier_5, exog=None)
modele_sarima_3_fit = modele_sarima_3.fit(disp=False)
print(modele_sarima_3_fit.summary())
modele_sarima_3_fit.plot_diagnostics(lags=20, figsize=(15, 12))
plt.show()
SARIMAX Results
==========================================================================================
Dep. Variable: Temperature No. Observations: 143
Model: SARIMAX(1, 0, 0)x(1, 1, 0, 12) Log Likelihood -381.213
Date: Thu, 09 May 2024 AIC 768.426
Time: 17:04:15 BIC 777.051
Sample: 02-07-2010 HQIC 771.931
- 10-28-2012
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.9812 0.020 50.213 0.000 0.943 1.020
ar.S.L12 -0.3216 0.101 -3.196 0.001 -0.519 -0.124
sigma2 19.1196 2.397 7.977 0.000 14.422 23.818
===================================================================================
Ljung-Box (L1) (Q): 4.22 Jarque-Bera (JB): 0.50
Prob(Q): 0.04 Prob(JB): 0.78
Heteroskedasticity (H): 0.96 Skew: 0.15
Prob(H) (two-sided): 0.90 Kurtosis: 3.06
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
forecast = modele_sarima_3_fit.get_forecast(steps=143)
forecast_values = forecast.predicted_mean
forecast_conf_int = forecast.conf_int()
plt.figure(figsize=(12, 6))
plt.plot(df2, label='Original Series')
plt.plot(forecast_values.index, forecast_values.values, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
plt.xlabel('Date')
plt.ylabel('Temperatures')
plt.title('Original Series and Forecast')
plt.legend()
plt.grid(True)
plt.show()
mae = mean_absolute_error(df2, forecast_values)
mse = mean_squared_error(df2, forecast_values)
rmse = np.sqrt(mse)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 25.442921157096524 Mean Squared Error (MSE): 885.1674497372217 Root Mean Squared Error (RMSE): 29.751763808843698
Unemployment
df = pd.read_csv('mean_data.csv')
df_avg_sales_U = df.groupby(['Date'])['Unemployment'].mean()
rng = pd.date_range(start='02-05-2010', end='10-31-2012', freq='W')
df_avg_sales_U = pd.DataFrame(df_avg_sales_U)
df_avg_sales_U.set_index(rng, inplace=True)
df2 = df_avg_sales_U[["Unemployment"]]
df2.set_index(rng, inplace=True)
df2.plot(figsize=(30,10), linewidth=5, fontsize=20)
plt.title("Plot of Unemployment", fontsize=30)
plt.xlabel('Year', fontsize=20)
plt.ylabel('Unemployment', fontsize=20)
plt.show()
ordre_sarima = (1, 0, 0)
# Seasonal order (P, D, Q, s)
ordre_saisonnier_1 = (0, 1, 0, 12)
ordre_saisonnier_2 = (1, 1, 1, 12)
ordre_saisonnier_3 = (0, 1, 1, 12)
ordre_saisonnier_4 = (0, 0, 1, 12)
ordre_saisonnier_5 = (1, 1, 0, 12)
modele_sarima_3 = sm.tsa.SARIMAX(df2, order=ordre_sarima, seasonal_order=ordre_saisonnier_2, exog=None)
modele_sarima_3_fit = modele_sarima_3.fit(disp=False)
print(modele_sarima_3_fit.summary())
modele_sarima_3_fit.plot_diagnostics(lags=20, figsize=(15, 12))
plt.show()
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
SARIMAX Results
============================================================================================
Dep. Variable: Unemployment No. Observations: 143
Model: SARIMAX(1, 0, 0)x(1, 1, [1], 12) Log Likelihood 188.641
Date: Thu, 09 May 2024 AIC -369.281
Time: 17:04:24 BIC -357.780
Sample: 02-07-2010 HQIC -364.608
- 10-28-2012
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.9988 0.011 94.137 0.000 0.978 1.020
ar.S.L12 0.0781 0.166 0.471 0.637 -0.246 0.403
ma.S.L12 -0.8993 0.303 -2.965 0.003 -1.494 -0.305
sigma2 0.0028 0.001 5.624 0.000 0.002 0.004
===================================================================================
Ljung-Box (L1) (Q): 1.65 Jarque-Bera (JB): 1938.35
Prob(Q): 0.20 Prob(JB): 0.00
Heteroskedasticity (H): 4.66 Skew: -3.85
Prob(H) (two-sided): 0.00 Kurtosis: 20.20
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
forecast = modele_sarima_3_fit.get_forecast(steps=143)
forecast_values = forecast.predicted_mean
forecast_conf_int = forecast.conf_int()
plt.figure(figsize=(12, 6))
plt.plot(df2, label='Original Series')
plt.plot(forecast_values.index, forecast_values.values, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.title('Original Series and Forecast')
plt.legend()
plt.grid(True)
plt.show()
mae = mean_absolute_error(df2, forecast_values)
mse = mean_squared_error(df2, forecast_values)
rmse = np.sqrt(mse)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 1.7637102946401182 Mean Squared Error (MSE): 3.133634990476137 Root Mean Squared Error (RMSE): 1.7702076122523418
Explanation of our SARIMA model:
To fit the SARIMA model, appropriate orders for autoregressive, differencing, moving average, and seasonal components needed to be selected. This process involves conducting multiple tests using different parameter values while analyzing autocorrelation plots, normal Q-Q plots, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion) values, and density histograms that we obtained from our diagnosis we ran. Additionally, custom code was developed to identify optimal parameter combinations for each variable.
Based on the analysis we obtained that the following models were the best fit for each variable: SARIMA (1,0,0)(0,1,1,12) for Fuel Price and Weekly Sales variables, SARIMA (1,0,0)(1,1,0,12) for Temperature variable, and SARIMA (1,0,0)(1,1,1,12) for Unemployment variable. The seasonality parameter of 12 was chosen based on the identification of yearly patterns in the data (part 1).
Comparaison between the models and their forecast:
Based on our analysis of visualizations, model diagnostics, and metrics, it is evident that the Prophet model outperforms the SARIMA model for each of our variables. The forecast generated by the Prophet model demonstrates higher precision and likelihood, aligning better with the observed data and the identified trends and seasonality. This conclusion is further supported by the metrics calculated, which consistently show lower Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) values for the Prophet model compared to SARIMA.
For instance, when examining the Unemployment variable, the Prophet model yielded the following metrics: Mean Absolute Error (MAE): 0.023, Mean Squared Error (MSE): 0.001, and Root Mean Squared Error (RMSE): 0.033. In contrast, the SARIMA model produced significantly higher scores: Mean Absolute Error (MAE): 1.763, Mean Squared Error (MSE): 3.133, and Root Mean Squared Error (RMSE): 1.770. This notable disparity in results strongly favors the Prophet model.
Here we add an inflation variable, exogenous to the dataset that will have an impact on the unemployment and the fuel price of the original dataset.
new_df = pd.read_csv("DATA.csv")
new_df = new_df[['DATE', 'INFLATION(%)']]
new_df.rename(columns={'DATE': 'Date'}, inplace=True)
new_df.rename(columns={'INFLATION(%)': 'Inflation'}, inplace=True)
Selects the parts of the new dataset corresponding to the dates of the original one.
new_df['Date'] = pd.to_datetime(new_df['Date'], format='%d-%m-%Y')
new_df = new_df[(new_df['Date'] >= '2010-02-01') & (new_df['Date'] <= '2012-10-26')]
Visualizations of our exegenous variable:
_plot_series1(new_df, 'Date', 'Inflation', '')
Here, we chose to use an outer join to retain the maximum amount of information available in both sides. Indeed, each source have been recorded data at different times, so, by using an outer join, we can ensure that we have a complete timeline even if one source did not record data at some points.
df_merged = pd.merge(data, new_df, on='Date', how='outer').sort_values(by='Date')
df_merged
| Date | Store | Weekly_Sales | Holiday_Flag | Temperature | Fuel_Price | CPI | Unemployment | Inflation | |
|---|---|---|---|---|---|---|---|---|---|
| 171 | 2010-02-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.143332 |
| 0 | 2010-02-05 | 23.0 | 1.105572e+06 | 0.0 | 34.037333 | 2.717844 | 167.730885 | 8.619311 | NaN |
| 1 | 2010-02-12 | 23.0 | 1.074148e+06 | 1.0 | 34.151333 | 2.694022 | 167.825608 | 8.619311 | NaN |
| 2 | 2010-02-19 | 23.0 | 1.072822e+06 | 0.0 | 37.719778 | 2.672067 | 167.871686 | 8.619311 | NaN |
| 3 | 2010-02-26 | 23.0 | 9.770794e+05 | 0.0 | 39.243556 | 2.683933 | 167.909657 | 8.619311 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 143 | 2012-10-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.162344 |
| 139 | 2012-10-05 | 23.0 | 1.057036e+06 | 0.0 | 65.973111 | 3.845222 | 176.505052 | 6.953711 | NaN |
| 140 | 2012-10-12 | 23.0 | 1.025078e+06 | 0.0 | 58.342667 | 3.896733 | 176.636515 | 6.953711 | NaN |
| 141 | 2012-10-19 | 23.0 | 1.002720e+06 | 0.0 | 60.705333 | 3.880000 | 176.652613 | 6.953711 | NaN |
| 142 | 2012-10-26 | 23.0 | 1.012091e+06 | 0.0 | 61.051111 | 3.791489 | 176.649482 | 6.953711 | NaN |
172 rows × 9 columns
As a result of the outer join we used, we have to deal with several NaN values in our new dataset. We used interpolation method to replace theese NaN values. This method provide an accurate way to estimate missing values by considering the trend in the data.
df_merged_inter = df_merged.copy()
df_merged_inter.interpolate(method='linear', inplace=True)
_plot_series3(df_merged_inter, 'Date', 'Inflation', 'Unemployment', '', 'between 2010-02 and 2012-10')
df_merged_un = df_merged[(df_merged['Date'] >= '2010-06-01') & (df_merged['Date'] <= '2011-09-01')]
df_merged_un.interpolate(method='linear', inplace=True)
_plot_series3(df_merged_un, 'Date', 'Inflation', 'Unemployment', '', 'between 2010-06 and 2011-09')
<ipython-input-88-82b5fc0d25ef>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_merged_un.interpolate(method='linear', inplace=True)
Impact of inflation on unemployment:
The relationship between inflation and unemployment is traditionally illustrated by the Phillips Curve. This economic model suggests an inverse relationship between the rate of unemployment and the rate of inflation, when inflation is high, unemployment tends to be lower, and vice versa. We can observe this phenomenon on the whole timeline of our dataset but specificaly on the period between 2010-06 and 2011-09.
_plot_series3(df_merged_inter, 'Date', 'Inflation', 'Fuel_Price', '', 'between 2010-02 and 2012-10')
df_merged_fp = df_merged[(df_merged['Date'] >= '2010-05-01') & (df_merged['Date'] <= '2012-01-01')]
df_merged_fp.interpolate(method='linear', inplace=True)
_plot_series3(df_merged_fp, 'Date', 'Inflation', 'Fuel_Price', '', 'between 2010-05 and 2012-01')
<ipython-input-90-907b100ba714>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_merged_fp.interpolate(method='linear', inplace=True)
Impact of inflation on fuel price:
Inflation means that the general level of prices for goods and services in an economy is rising, so, it has a direct impact on the fuel price. Indeed, the more inflation increases, the more fuel price will increase. We can observe this phenomenon on the whole timeline of our dataset but specificaly on the period between 2010-05 and 2012-01.
df_merged_fp.dropna(subset=['Inflation', 'Fuel_Price'], inplace=True)
df_merged_un.dropna(subset=['Inflation', 'Unemployment'], inplace=True)
<ipython-input-91-202651bdd47a>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_merged_fp.dropna(subset=['Inflation', 'Fuel_Price'], inplace=True) <ipython-input-91-202651bdd47a>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_merged_un.dropna(subset=['Inflation', 'Unemployment'], inplace=True)
Creation of a regression model in order to better identify the relation between our exegenous variable(inflation) and the fuel prices and unemployment rate from our original data :
import statsmodels.api as sm
def regression_model(df, col1, col2):
X = sm.add_constant(df[col1])
model = sm.OLS(df[col2], X).fit()
print(model.summary())
df[f'Predicted_{col2}'] = model.predict(X)
plt.figure(figsize=(10, 6))
plt.plot(df[col1], df[col2], 'o', label='Data')
plt.plot(df[col1], df[f'Predicted_{col2}'], 'r--', label='Regression Line')
plt.xlabel(col1)
plt.ylabel(col2)
plt.title(f'Relationship between {col1} and {col2}')
plt.legend()
plt.show()
regression_model(df_merged_fp, 'Inflation', 'Fuel_Price')
regression_model(df_merged_un, 'Inflation', 'Unemployment')
<ipython-input-92-ffc44b49fc62>:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'Predicted_{col2}'] = model.predict(X)
OLS Regression Results
==============================================================================
Dep. Variable: Fuel_Price R-squared: 0.933
Model: OLS Adj. R-squared: 0.932
Method: Least Squares F-statistic: 1416.
Date: Thu, 09 May 2024 Prob (F-statistic): 1.29e-61
Time: 17:04:33 Log-Likelihood: 86.924
No. Observations: 104 AIC: -169.8
Df Residuals: 102 BIC: -164.6
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.3627 0.026 90.145 0.000 2.311 2.415
Inflation 0.3715 0.010 37.625 0.000 0.352 0.391
==============================================================================
Omnibus: 15.256 Durbin-Watson: 0.107
Prob(Omnibus): 0.000 Jarque-Bera (JB): 16.960
Skew: 0.908 Prob(JB): 0.000208
Kurtosis: 3.786 Cond. No. 7.52
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: Unemployment R-squared: 0.936
Model: OLS Adj. R-squared: 0.935
Method: Least Squares F-statistic: 1097.
Date: Thu, 09 May 2024 Prob (F-statistic): 1.60e-46
Time: 17:04:33 Log-Likelihood: 143.37
No. Observations: 77 AIC: -282.7
Df Residuals: 75 BIC: -278.1
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 8.6170 0.010 853.740 0.000 8.597 8.637
Inflation -0.1338 0.004 -33.124 0.000 -0.142 -0.126
==============================================================================
Omnibus: 0.459 Durbin-Watson: 0.381
Prob(Omnibus): 0.795 Jarque-Bera (JB): 0.588
Skew: 0.162 Prob(JB): 0.745
Kurtosis: 2.720 Cond. No. 6.59
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<ipython-input-92-ffc44b49fc62>:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'Predicted_{col2}'] = model.predict(X)
This regression support our idea and shows us a clearly correlation between the inflation and the fuel price (positive correlation), but also between the inflation and the Unemployment rate (negative correlation). We can say that these results are logical and consistent with our expectation.
Application of a SARIMAX model:
df_merged = pd.merge(data, new_df, on='Date', how='outer').sort_values(by='Date')
df_merged.interpolate(method='linear', inplace=True)
df_merged.dropna(subset=['Inflation', 'Fuel_Price'], inplace=True)
df_merged.dropna(subset=['Inflation', 'Unemployment'], inplace=True)
date_beg = '2010-02-05'
date_end = '2012-10-31'
df_merged['Date'] = pd.to_datetime(df_merged['Date'])
date_range = pd.date_range(start=date_beg, end=date_end, freq='7D')
df_complete = pd.merge(df_merged, pd.DataFrame({'Date': date_range}), on='Date', how='right')
df_complete.sort_values(by='Date', inplace=True)
df_complete.reset_index(drop=True, inplace=True)
rng = pd.date_range(start='02-05-2010', end='10-31-2012', freq='W')
df_complete = pd.DataFrame(df_complete)
df_complete.set_index(rng, inplace=True)
df_selected = df_complete[['Date', 'Inflation', 'Fuel_Price']]
df_selected.set_index(rng, inplace=True)
df_selected = df_selected.dropna()
Using the Fuel Prices variable
order = (1, 0, 0)
seasonal_order = (0, 1, 1, 12)
endog = np.asarray(df_selected['Fuel_Price'])
exog = np.asarray(df_selected[['Inflation']])
model = sm.tsa.SARIMAX(endog, exog=exog, order=order, seasonal_order=seasonal_order)
results = model.fit(disp=False)
print(results.summary())
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:1009: UserWarning: Non-invertible starting seasonal moving average Using zeros as starting parameters.
warn('Non-invertible starting seasonal moving average'
SARIMAX Results
============================================================================================
Dep. Variable: y No. Observations: 143
Model: SARIMAX(1, 0, 0)x(0, 1, [1], 12) Log Likelihood 208.121
Date: Thu, 09 May 2024 AIC -408.242
Time: 17:05:25 BIC -396.742
Sample: 0 HQIC -403.569
- 143
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.2025 0.046 4.430 0.000 0.113 0.292
ar.L1 0.9997 0.007 140.910 0.000 0.986 1.014
ma.S.L12 -0.9625 0.443 -2.171 0.030 -1.832 -0.093
sigma2 0.0020 0.001 2.164 0.030 0.000 0.004
===================================================================================
Ljung-Box (L1) (Q): 46.79 Jarque-Bera (JB): 0.84
Prob(Q): 0.00 Prob(JB): 0.66
Heteroskedasticity (H): 2.63 Skew: 0.03
Prob(H) (two-sided): 0.00 Kurtosis: 2.61
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
results.plot_diagnostics(figsize=(15, 12))
plt.show()
# Forecasting
forecast_steps = 143
forecast = results.get_forecast(steps=forecast_steps, exog=exog)
forecast_values = forecast.predicted_mean
forecast_conf_int = pd.DataFrame(forecast.conf_int())
forecast_conf_int.set_index(df_selected.index, inplace=True)
plt.figure(figsize=(10, 5))
forecast_df = pd.DataFrame(forecast_values, index=df_selected.index)
# Plot original and forecast
plt.figure(figsize=(10, 5))
plt.plot(df_selected["Fuel_Price"], label='Original Series')
plt.plot(forecast_df, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
plt.title('Fuel Price Forecast')
plt.xlabel('Date')
plt.ylabel('Fuel Price')
plt.legend()
plt.show()
# Evaluate forecast accuracy
actuals = df_selected['Fuel_Price'][-forecast_steps:]
mae = mean_absolute_error(actuals, forecast_values)
mse = mean_squared_error(actuals, forecast_values)
rmse = np.sqrt(mse)
print("Forecast Accuracy Measures:")
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
<Figure size 1000x500 with 0 Axes>
Forecast Accuracy Measures: MAE: 0.8458238368744987 MSE: 0.7542514868360004 RMSE: 0.8684765321158657
Based on this analysis we can say that our model fit pretty good our data and our added variable. Moreover the forecast accuracy measures, including MAE, MSE, and RMSE, provide insights into the performance of the prediction model. Despite some level of error, the model demonstrates a reasonable ability to predict future values based on the given data and exogenous variables.
Using the Unemployment variable
df_selected2 = df_complete[['Date', 'Inflation', 'Unemployment']]
df_selected2.set_index(rng, inplace=True)
df_selected2 = df_selected2.dropna()
order = (1, 0, 0)
seasonal_order = (1, 1, 1, 12)
endog2 = np.asarray(df_selected2['Unemployment'])
exog2 = np.asarray(df_selected2[['Inflation']])
model = sm.tsa.SARIMAX(endog2, exog=exog2, order=order, seasonal_order=seasonal_order)
results = model.fit(disp=False)
print(results.summary())
SARIMAX Results
============================================================================================
Dep. Variable: y No. Observations: 143
Model: SARIMAX(1, 0, 0)x(1, 1, [1], 12) Log Likelihood 188.709
Date: Thu, 09 May 2024 AIC -367.418
Time: 17:05:56 BIC -353.042
Sample: 0 HQIC -361.577
- 143
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.0171 0.072 0.238 0.812 -0.124 0.158
ar.L1 0.9992 0.008 126.362 0.000 0.984 1.015
ar.S.L12 0.0892 0.165 0.542 0.588 -0.233 0.412
ma.S.L12 -0.9160 0.346 -2.649 0.008 -1.594 -0.238
sigma2 0.0028 0.001 4.647 0.000 0.002 0.004
===================================================================================
Ljung-Box (L1) (Q): 1.71 Jarque-Bera (JB): 1927.60
Prob(Q): 0.19 Prob(JB): 0.00
Heteroskedasticity (H): 4.62 Skew: -3.85
Prob(H) (two-sided): 0.00 Kurtosis: 20.14
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
results.plot_diagnostics(figsize=(15, 12))
plt.show()
# Forecasting
forecast_steps = 143
forecast = results.get_forecast(steps=forecast_steps, exog=exog)
forecast_values = forecast.predicted_mean
forecast_conf_int = pd.DataFrame(forecast.conf_int())
forecast_conf_int.set_index(df_selected2.index, inplace=True)
plt.figure(figsize=(10, 5))
forecast_df = pd.DataFrame(forecast_values, index=df_selected2.index)
# Plot original and forecast
plt.figure(figsize=(10, 5))
plt.plot(df_selected2["Unemployment"], label='Original Series')
plt.plot(forecast_df, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
plt.title('Unemployment Forecast')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()
plt.show()
# Evaluate forecast accuracy
actuals = df_selected2['Unemployment'][-forecast_steps:]
mae = mean_absolute_error(actuals, forecast_values)
mse = mean_squared_error(actuals, forecast_values)
rmse = np.sqrt(mse)
print("Forecast Accuracy Measures:")
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
<Figure size 1000x500 with 0 Axes>
Forecast Accuracy Measures: MAE: 1.7840589351319822 MSE: 3.202114894278942 RMSE: 1.7894454152834454
In the same way, we can say that the model is a good fitting to our data and has accurate predictions while having higher values in the accuracy metrics than them obtained with our Fuel prices variable.